{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chapter 15\n",
    "---\n",
    "# K-Nearest Neighbors\n",
    "\n",
    "An observation is predicted to be the class of that of the largest proportion of the k-nearest observations.\n",
    "\n",
    "## 15.1 Finding an Observation's Nearest Neighbors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[[1.03800476, 0.56925129, 1.10395287, 1.1850097 ],\n",
       "        [0.79566902, 0.33784833, 0.76275864, 1.05353673]]])"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import datasets\n",
    "from sklearn.neighbors import NearestNeighbors\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "iris = datasets.load_iris()\n",
    "features = iris.data\n",
    "\n",
    "standardizer = StandardScaler()\n",
    "\n",
    "features_standardized = standardizer.fit_transform(features)\n",
    "\n",
    "nearest_neighbors = NearestNeighbors(n_neighbors=2).fit(features_standardized)\n",
    "#nearest_neighbors_euclidian = NearestNeighbors(n_neighbors=2, metric='euclidian').fit(features_standardized)\n",
    "new_observation = [1, 1, 1, 1]\n",
    "\n",
    "distances, indices = nearest_neighbors.kneighbors([new_observation])\n",
    "\n",
    "features_standardized[indices]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Discussion\n",
    "\n",
    "How do we measure distance?\n",
    "\n",
    "* Euclidian\n",
    "$$\n",
    "d_{euclidean} = \\sqrt{\\sum_{i=1}^{n}{(x_i - y_i)^2}}\n",
    "$$\n",
    "\n",
    "* Manhattan\n",
    "$$\n",
    "d_{manhattan} = \\sum_{i=1}^{n}{|x_i - y_i|}\n",
    "$$\n",
    "\n",
    "* Minkowski (default)\n",
    "$$\n",
    "d_{minkowski} = (\\sum_{i=1}^{n}{|x_i - y_i|^p})^{\\frac{1}{p}}\n",
    "$$\n",
    "## 15.2 Creating a K-Nearest Neighbor Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 2])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn import datasets\n",
    "\n",
    "iris = datasets.load_iris()\n",
    "X = iris.data\n",
    "y = iris.target\n",
    "\n",
    "standardizer = StandardScaler()\n",
    "\n",
    "X_std = standardizer.fit_transform(X)\n",
    "\n",
    "knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1).fit(X_std, y)\n",
    "\n",
    "new_observations = [[0.75, 0.75, 0.75, 0.75],\n",
    "                   [1, 1, 1, 1]]\n",
    "\n",
    "knn.predict(new_observations)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Discussion\n",
    "In KNN, given an observation $x_u$, with an unknown target class, the algorithm first identifies the k closest observations (sometimes called $x_u$'s neighborhood) based on some distance metric, then these k observations \"vote\" based on their class and the class that wins the vote is $x_u$'s predicted class. More formally, the probability $x_u$ is some class j is:\n",
    "$$\n",
    "\\frac{1}{k} \\sum_{i \\in v}{I(y_i = j)}\n",
    "$$\n",
    "where v is the k observatoin in $x_u$'s neighborhood, $y_i$ is the class of the ith observation, and I is an indicator function (i.e., 1 is true, 0 otherwise). In scikit-learn we can see these probabilities using `predict_proba`\n",
    "\n",
    "## 15.3 Identifying the Best Neighborhood Size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn import datasets\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.pipeline import Pipeline, FeatureUnion\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "iris = datasets.load_iris()\n",
    "features = iris.data\n",
    "target = iris.target\n",
    "\n",
    "standardizer = StandardScaler()\n",
    "features_standardized = standardizer.fit_transform(features)\n",
    "\n",
    "knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)\n",
    "\n",
    "pipe = Pipeline([(\"standardizer\", standardizer), (\"knn\", knn)])\n",
    "\n",
    "search_space = [{\"knn__n_neighbors\": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]\n",
    "\n",
    "classifier = GridSearchCV(\n",
    "    pipe, search_space, cv=5, verbose=0).fit(features_standardized, target)\n",
    "\n",
    "classifier.best_estimator_.get_params()[\"knn__n_neighbors\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 15.4 Creating a Radius-Based Nearest Neighbor Classifier\n",
    "given an observation of unknown class, you need to predict its class based on the class of all observations within a certain distance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([2])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.neighbors import RadiusNeighborsClassifier\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn import datasets\n",
    "\n",
    "iris = datasets.load_iris()\n",
    "features = iris.data\n",
    "target = iris.target\n",
    "\n",
    "standardizer = StandardScaler()\n",
    "features_standardized = standardizer.fit_transform(features)\n",
    "\n",
    "rnn = RadiusNeighborsClassifier(\n",
    "    radius=.5, n_jobs=-1).fit(features_standardized, target)\n",
    "\n",
    "new_observations = [[1, 1, 1, 1]]\n",
    "\n",
    "rnn.predict(new_observations)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:machine_learning_cookbook]",
   "language": "python",
   "name": "conda-env-machine_learning_cookbook-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
